Written notes - page XXX*
Effects of starvation and humidity on water loss in the confused flour beetle. Here, looking at the relationship between humidity and weight loss
flr_beetle <- read.csv(file = "https://mlammens.github.io/Biostats/data/Logan_Examples/Chapter8/Data/nelson.csv")
flr_beetle
## HUMIDITY WEIGHTLOSS
## 1 0.0 8.98
## 2 12.0 8.14
## 3 29.5 6.67
## 4 43.0 6.08
## 5 53.0 5.90
## 6 62.5 5.83
## 7 75.5 4.68
## 8 85.0 4.20
## 9 93.0 3.72
Plot these data
library(ggplot2)
ggplot() +
geom_point(data = flr_beetle, aes( x = HUMIDITY, y = WEIGHTLOSS)) +
theme_bw()
Run a linear regression
flr_beetle_lm <- lm(data = flr_beetle, WEIGHTLOSS ~ HUMIDITY)
summary(flr_beetle_lm)
##
## Call:
## lm(formula = WEIGHTLOSS ~ HUMIDITY, data = flr_beetle)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46397 -0.03437 0.01675 0.07464 0.45236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.704027 0.191565 45.44 6.54e-10 ***
## HUMIDITY -0.053222 0.003256 -16.35 7.82e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2967 on 7 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9708
## F-statistic: 267.2 on 1 and 7 DF, p-value: 7.816e-07
Plot these data, with lm fit
ggplot(data = flr_beetle, aes( x = HUMIDITY, y = WEIGHTLOSS)) +
geom_point() +
stat_smooth(method = "lm") +
theme_bw()
Leverage: a measure of how extreme a value in x-space is (i.e., along the x-axis) and how much of an influence each \(x_i\) has on \(\hat{y}_i\). High leverage values indicate that model is unduly influenced by an outlying value.
Residuals: the differences between the observed \(y_i\) values and the predicted \(\hat{y}_i\) values. Looking at the pattern between residuals and \(\hat{y}_i\) values can yield important information regarding violation of model assumptions (e.g., homogeneity of variance).
Cook’s D: a statistics that offers a measure of the influence of each data point on the fitted model that incorporates both leverage and residuals. Values \(\ge 1\) are considered “suspect”.
plot(flr_beetle_lm)
Regression coefficients are statistics and thus we can determine the standard error of these statistics. From there, we can use these values and the t-distribution to determine confidence intervals. For example, the confidence interval for \(b_1\) is:
\[ b_1 \pm t_{0.05, n-2}s_{b_{1}} \]
\[ s_{b_{1}} = \sqrt{ \frac{MS_{Residual}}{\sum^n_{i=1}(x_i-\bar{x})^2} } \]
\[ s_{b_{0}} = \sqrt{ MS_{Residual} [\frac{1}{n} + \frac{\bar{x}^2}{\sum^n_{i=1}(x_i-\bar{x})^2}] } \]
From Quinn and Keough, p. 87:
The 95% confidence band is a biconcave band that will contain the true population regression line 95% of the time.
\[ s_{\hat{y}} = \sqrt{ MS_{Residual} [1 + \frac{1}{n} + \frac{x_p - \bar{x}^2}{\sum^n_{i=1}(x_i-\bar{x})^2}] } \]
where \(x_p\) is the value from \(x\) we are “predicting” a \(y\) value for.
In order to determine if your linear regression model is meaningful (or significant) you must compare the variance explained by your model versus the variance unexplained.
Note that there is a typo in this figure in panel (c). Instead of “Explained variability”, the arrow tag should be “Unexplained variability”.
Below is a list of techniques that are covered well in both of the class textbooks.
What do we do when we have two or more predictors?
See written notes – page 2
Also, look at this totally awesome site! Visualizing Relations in Multiple Regression
Only the predictors themselves are considered.
\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_j x_{ij} + \epsilon_i \]
The predictors and the interactions between predictors are considered.
\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i1} x_{i2} + ... + \beta_j x_{ij} + \epsilon_i \]
where \(\beta_3 x_{i1} x_{i2}\) is the interactive effect of \(X_1\) and \(X_2\) on \(Y\) and it examines the degree to which the effect of one of the predictor variables depends on the levels of the other (Logan p. 209).
NOTE: There are many more coefficients that need to be estimated in the multiplicative model!
Temperature in June and temperature in July
Temperature in June and rainfall in June
Loyn (1987) investigated the effects of habitat fragmentation on abundance of forest birds. He considered multiple predictor variables associated with fragmentation or other important environmental conditions.
Get data and examine.
bird_frag <- read.csv(file = "https://mlammens.github.io/Biostats/data/Logan_Examples/Chapter9/Data/loyn.csv")
summary(bird_frag)
## ABUND AREA YR.ISOL DIST
## Min. : 1.50 Min. : 0.10 Min. :1890 Min. : 26.0
## 1st Qu.:12.40 1st Qu.: 2.00 1st Qu.:1928 1st Qu.: 93.0
## Median :21.05 Median : 7.50 Median :1962 Median : 234.0
## Mean :19.51 Mean : 69.27 Mean :1950 Mean : 240.4
## 3rd Qu.:28.30 3rd Qu.: 29.75 3rd Qu.:1966 3rd Qu.: 333.2
## Max. :39.60 Max. :1771.00 Max. :1976 Max. :1427.0
## LDIST GRAZE ALT
## Min. : 26.0 Min. :1.000 Min. : 60.0
## 1st Qu.: 158.2 1st Qu.:2.000 1st Qu.:120.0
## Median : 338.5 Median :3.000 Median :140.0
## Mean : 733.3 Mean :2.982 Mean :146.2
## 3rd Qu.: 913.8 3rd Qu.:4.000 3rd Qu.:182.5
## Max. :4426.0 Max. :5.000 Max. :260.0
Plot these data.
library(GGally)
ggpairs(bird_frag)
We can clearly see here that some of our predictor variables are not normally distributed, particularly AREA, and to a lesser extent DIST and LDIST. We will try to transform these data, to see if we can get them to look a bit more normal.
bird_frag_transform <- bird_frag
bird_frag_transform$AREA <- log10(bird_frag$AREA)
bird_frag_transform$DIST <- log10(bird_frag$DIST)
bird_frag_transform$LDIST <- log10(bird_frag$LDIST)
Replot.
ggpairs(bird_frag_transform)
We also see here that there are no extremely high correlations among the predictor variables. How we define “extremely high” is relatively arbitrary, but generally anything greater than 0.8.
To be safe, we’ll examine the VIFs.
library(car)
vif(lm(data = bird_frag, ABUND ~ log10(AREA) + YR.ISOL + log10(DIST) + log10(LDIST) + GRAZE + ALT))
## log10(AREA) YR.ISOL log10(DIST) log10(LDIST) GRAZE
## 1.911514 1.804769 1.654553 2.009749 2.524814
## ALT
## 1.467937
And, we’re good.
Now run the multiple linear regression model.
bird_frag_lm <- lm(data = bird_frag, ABUND ~ log10(AREA) + YR.ISOL + log10(DIST) + log10(LDIST) + GRAZE + ALT)
Diagnostic plots.
plot(bird_frag_lm)
Looks good!
Now look at the model summary.
summary(bird_frag_lm)
##
## Call:
## lm(formula = ABUND ~ log10(AREA) + YR.ISOL + log10(DIST) + log10(LDIST) +
## GRAZE + ALT, data = bird_frag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6506 -2.9390 0.5289 2.5353 15.2842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -125.69725 91.69228 -1.371 0.1767
## log10(AREA) 7.47023 1.46489 5.099 5.49e-06 ***
## YR.ISOL 0.07387 0.04520 1.634 0.1086
## log10(DIST) -0.90696 2.67572 -0.339 0.7361
## log10(LDIST) -0.64842 2.12270 -0.305 0.7613
## GRAZE -1.66774 0.92993 -1.793 0.0791 .
## ALT 0.01951 0.02396 0.814 0.4195
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.384 on 49 degrees of freedom
## Multiple R-squared: 0.6849, Adjusted R-squared: 0.6464
## F-statistic: 17.75 on 6 and 49 DF, p-value: 8.443e-11
What if we wanted to look at each partial regression plot? Use av.plots from the car package.
avPlots(bird_frag_lm, ask = F)
Another way to approximately visualize these relationships.
library(reshape)
bird_frag_transform_melt <- melt(bird_frag_transform, id.vars = c("ABUND"))
ggplot(data = bird_frag_transform_melt, aes(x = value, y = ABUND)) +
geom_point() +
stat_smooth(method = "lm") +
facet_wrap( ~variable, scales = "free" )
Paruelo and Lauenroth (1996) examined the relationships between the abundance of \(C_3\) plants (those that use \(C_3\) photosynthesis) and geographic and climactic variables. Here we are only going to consider the geographic variables.
Get data and examine.
c3_plants <- read.csv(file = "https://mlammens.github.io/Biostats/data/Logan_Examples/Chapter9/Data/paruelo.csv")
summary(c3_plants)
## C3 MAP MAT JJAMAP
## Min. :0.0000 Min. : 117 Min. : 2.000 Min. :0.1000
## 1st Qu.:0.0500 1st Qu.: 345 1st Qu.: 6.900 1st Qu.:0.2000
## Median :0.2100 Median : 421 Median : 8.500 Median :0.2900
## Mean :0.2714 Mean : 482 Mean : 9.999 Mean :0.2884
## 3rd Qu.:0.4700 3rd Qu.: 575 3rd Qu.:12.900 3rd Qu.:0.3600
## Max. :0.8900 Max. :1011 Max. :21.200 Max. :0.5100
## DJFMAP LONG LAT
## Min. :0.1100 Min. : 93.2 Min. :29.00
## 1st Qu.:0.1500 1st Qu.:101.8 1st Qu.:36.83
## Median :0.2000 Median :106.5 Median :40.17
## Mean :0.2275 Mean :106.4 Mean :40.10
## 3rd Qu.:0.3100 3rd Qu.:111.8 3rd Qu.:43.95
## Max. :0.4900 Max. :119.5 Max. :52.13
Again, we’re only going to consider the geographic variable - lattitude and longitude (LONG and LAT).
Have a look at these data in graphical format.
ggpairs(c3_plants)
C3 abundance is not normally distributed. Let’s convert using a log10(C3 + 0.1) conversion. The + 0.1 is needed because the log of 0 is negative infinity.
Also, we should center the lat and long data.
c3_plants$LAT <- scale(c3_plants$LAT, scale = FALSE)
c3_plants$LONG <- scale(c3_plants$LONG, scale = FALSE)
We can create a model with an interaction term in two ways. They yield identical results.
c3_plants_lm1 <- lm(data = c3_plants, log10(C3 + 0.1) ~ LONG + LAT + LONG:LAT)
c3_plants_lm2 <- lm(data = c3_plants, log10(C3 + 0.1) ~ LONG*LAT)
Check out the summary results of both.
summary(c3_plants_lm1)
##
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG + LAT + LONG:LAT, data = c3_plants)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54185 -0.13298 -0.02287 0.16807 0.43410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5529416 0.0274679 -20.130 < 2e-16 ***
## LONG -0.0025787 0.0043182 -0.597 0.5523
## LAT 0.0483954 0.0057047 8.483 2.61e-12 ***
## LONG:LAT 0.0022522 0.0008757 2.572 0.0123 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared: 0.5137, Adjusted R-squared: 0.4926
## F-statistic: 24.3 on 3 and 69 DF, p-value: 7.657e-11
summary(c3_plants_lm2)
##
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG * LAT, data = c3_plants)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54185 -0.13298 -0.02287 0.16807 0.43410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5529416 0.0274679 -20.130 < 2e-16 ***
## LONG -0.0025787 0.0043182 -0.597 0.5523
## LAT 0.0483954 0.0057047 8.483 2.61e-12 ***
## LONG:LAT 0.0022522 0.0008757 2.572 0.0123 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared: 0.5137, Adjusted R-squared: 0.4926
## F-statistic: 24.3 on 3 and 69 DF, p-value: 7.657e-11
Look at diagnostic plots.
plot(c3_plants_lm1)
Look at non-interaction model.
c3_plants_lm_noint <- lm(data = c3_plants, log10(C3 + 0.1) ~ LONG + LAT)
summary(c3_plants_lm_noint)
##
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG + LAT, data = c3_plants)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50434 -0.20010 -0.02084 0.20813 0.43932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.545622 0.028395 -19.216 < 2e-16 ***
## LONG -0.003737 0.004464 -0.837 0.405
## LAT 0.042430 0.005417 7.833 3.71e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2426 on 70 degrees of freedom
## Multiple R-squared: 0.4671, Adjusted R-squared: 0.4519
## F-statistic: 30.68 on 2 and 70 DF, p-value: 2.704e-10
anova(c3_plants_lm_noint,c3_plants_lm1)
## Analysis of Variance Table
##
## Model 1: log10(C3 + 0.1) ~ LONG + LAT
## Model 2: log10(C3 + 0.1) ~ LONG + LAT + LONG:LAT
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 70 4.1200
## 2 69 3.7595 1 0.36043 6.615 0.01227 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1